Virtual device composition in a scalable input/output (i/o) virtualization (s-iov) architecture

ABSTRACT

Examples may include a method of instantiating a virtual machine, instantiating a virtual device to transmit data to and receive data from assigned resources of a shared physical device; and assigning the virtual device to the virtual machine, the virtual machine to transmit data to and receive data from the physical device via the virtual device.

CROSS-REFERENCE TO RELATED APPLICATION

This application claims the benefit of U.S. Provisional Application No.62/721,483, filed Aug. 22, 2018.

BACKGROUND

The introduction of the Single Root I/O Virtualization (SR-IOV) andSharing specification, version 1.1, published Jan. 20, 2010 by thePeripheral Component Interconnect (PCI) Special Interest Group(PCI-SIG), was a notable advancement toward hardware-assisted highperformance I/O virtualization and sharing for PCI Express devices. PCIExpress (PCIe) is defined by PCI Express Base Specification, revision4.0, version 1.0, published Oct. 5, 2017. Since then, the computelandscape has evolved beyond deploying virtual machines (VMs) forcomputer server consolidation to hyper-scale data centers which need toseamlessly add resources and dynamically provision containers. The newcomputing environment demands increased scalability and flexibility forI/O virtualization.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an example computing system.

FIG. 2 illustrates an example diagram of some high-level differencesbetween the SR-IOV and Scalable I/O virtualization (IOV) architectures.

FIG. 3 illustrates an example diagram of some differences between SR-IOVcapable and Scalable IOV capable endpoint devices.

FIG. 4 illustrates an example diagram of a high-level softwarearchitecture for Scalable IOV.

FIG. 5 illustrates an example diagram of a logical view of AssignableDevice Interfaces (ADIs) with varying numbers of device backendresources, and virtualization software composing virtual device (VDEV)instances with one or more ADIs.

FIG. 6 illustrates a diagram of an example Scalable IOV DesignatedVendor Specific Extended Capability (DVSEC) structure.

FIG. 7 illustrates an example high-level translation structureorganization for scalable mode address translation.

FIG. 8 illustrates an example diagram of virtual device composition.

FIG. 9 illustrates an example flow diagram of virtual devicecomposition.

FIG. 10 illustrates an example diagram of virtual device compositionusing physical function ADIs and Interrupt Message Services (IMS)resources.

FIG. 11 illustrates an example flow diagram of virtual device resourcemodification.

FIG. 12 illustrates an example flow diagram of virtual devicedecomposition.

FIG. 13 illustrates an example of a storage medium.

FIG. 14 illustrates another example computing platform.

DETAILED DESCRIPTION

Embodiments of the present invention disclose a Scalable I/Ovirtualization (Scalable IOV) architecture and associated host computingplatform and endpoint device capabilities. Scalable IOV defines ascalable and flexible approach to hardware-assisted I/O virtualizationtargeting hyper-scale usages. Scalable IOV builds on an already existingset of PCI Express capabilities, enabling the Scalable IOV architectureto be easily supported by compliant PCI Express endpoint device designsand existing software ecosystems.

Virtualization allows system software called a virtual machine monitor(VMM), also known as a hypervisor, to create multiple isolated executionenvironments called virtual machines (VMs) in which operating systems(OSs) and applications can run. Virtualization is extensively used inmodern enterprise and cloud data centers as a mechanism to consolidatemultiple workloads onto a single physical machine while still keepingthe workloads isolated from each other. Besides VMs, containers provideanother type of isolated environment that are used to package and deployapplications and run them in an isolated processing environment.Containers are constructed as either bare-metal containers that areinstantiated as OS process groups or as machine containers that utilizethe increased isolation properties of hardware support forvirtualization. Containers are lighter weight and can be deployed inmuch higher density than VMs, potentially increasing the number ofcontainer instances on a computing platform by an order of magnitude.

Modern processors provide features to reduce virtualization overheadthat may be utilized by VMMs to allow VMs direct access to hardwareresources. Intel® Virtualization Technology (Intel® VT) (for IA-32Intel® Architecture (Intel® VT-x) defines the Intel® processor hardwarecapabilities to reduce overheads for processor and memoryvirtualization. Intel® Virtualization Technology (Intel® VT) forDirected I/O (Intel® VT-d) defines the computing platform hardwarefeatures for direct memory access (DMA) and interrupt remapping andisolation that can be utilized to minimize overheads of I/Ovirtualization. I/O virtualization refers to the virtualization andsharing of I/O devices across multiple VMs or container instances. Thereare multiple approaches for I/O virtualization that may be broadlyclassified as either software-based or hardware-assisted.

With software-based I/O virtualization, the VMM exposes a virtual device(such as network interface controller (NIC) functionality, for example)to a VM. A software device model in the VMM or host OS emulates thebehavior of the virtual device. The software device model translatesvirtual device commands to physical device commands before forwardingthe commands to the physical device. Such software emulation of devicescan provide compatibility to software running within VMs but incurssignificant performance overhead, especially for high performancedevices. In addition to the performance limitations, emulating virtualdevice accesses in software can be too complex for programmable devicessuch as Graphics Processing Units (GPUs) and Field-Programmable GateArrays (FPGAs) because these devices perform a variety of functionsversus only a fixed set of functions. Variants of software-based I/Ovirtualization such as ‘device paravirtualization’ and ‘mediatedpass-through’ allow the computing platform to mitigate some of theperformance and complexity disadvantages with device emulation.

To avoid the software-based I/O virtualization overheads, VMMs may makeuse of platform support for DMA and interrupt remapping capability (suchas Intel® VT-d) to support ‘direct device assignment’ allowing guestsoftware to directly access the assigned device. This direct deviceassignment provides the best I/O virtualization performance since theVMM is no longer in the way of most guest software accesses to thedevice. However, this approach requires the device to be exclusivelyassigned to a VM and does not support sharing of the device acrossmultiple VMs.

Single Root I/O Virtualization (SR-IOV) is a PCI-SIG definedspecification for hardware-assisted I/O virtualization that defines astandard way for partitioning endpoint devices for direct sharing acrossmultiple VMs or containers. An SR-IOV capable endpoint device maysupport one or more Physical Functions (PFs), each of which may supportmultiple Virtual Functions (VFs). The PF functions as the resourcemanagement entity for the device and is managed by a PF driver in thehost OS. Each VF can be assigned to a VM or container for direct access.SR-IOV is supported by multiple high performance I/O devices such asnetwork and storage controller devices as well as programmable orreconfigurable devices such as GPUs, FPGAs and other emergingaccelerators.

In some embodiments, SR-IOV is implemented using PCIe. In otherembodiments, interconnects other than PCIe may be used.

As hyper-scale computing models proliferate along with an increasingnumber of processing elements (e.g., processing cores) on modernprocessors, a high-volume computing platform (e.g., computer server) isused to host an order of magnitude higher number of bare-metal ormachine containers than traditional VMs. Many of these usages such asnetwork function virtualization (NFV) or heterogeneous computing withaccelerators require high performance hardware-assisted I/Ovirtualization. These dynamically provisioned high-density usages (i.e.,on the order of 1,000 domains) demand more scalable and fine-grained I/Ovirtualization solutions than are provided by traditional virtualizationusages supported by SR-IOV capable devices.

Scalable IOV as described in embodiments of the present invention hereinis a new approach to hardware-assisted I/O virtualization that enableshighly scalable and high-performance sharing of I/O devices acrossisolated domains, while containing the cost and complexity for endpointdevice hardware to support such scalable sharing. Depending on the usagemodel, the isolated domains may be traditional VMs, machine containers,bare-metal containers, or application processes. Embodiments of thepresent invention primarily refer to isolated domains as VMs, but thegeneral principles apply broadly to other domain abstractions such ascontainers.

FIG. 1 illustrates an example computing system 100 supporting theScalable IOV architecture. As shown in FIG. 1, computing system 100includes a computing platform 101 coupled to a network 170 (which may bethe Internet, for example, or a network within a data center). In someexamples, as shown in FIG. 1, computing platform 101 is coupled tonetwork 170 via network communication channel 175 and through at leastone network I/O device 110 (e.g., a network interface controller (NIC))having one or more ports connected or coupled to network communicationchannel 175. In an embodiment, network I/O device 110 is an EthernetNIC. Network I/O device 110 transmits data packets from computingplatform 101 over network 170 to other destinations and receives datapackets from other destinations for forwarding to computing platform101.

According to some examples, computing platform 101, as shown in FIG. 1,includes circuitry 120, primary memory 130, network (NW) I/O devicedriver 140, operating system (OS) 150, virtual machine manager (VMM) 180(also known as a hypervisor), at least one application 160, and one ormore storage devices 165. In one embodiment, OS 150 is Linux™. Inanother embodiment, OS 150 is Windows® Server. In an embodiment,application 160 represents one or more application programs executed byone or more guest VMs (not shown). Network I/O device driver 140operates to initialize and manage I/O requests performed by network I/Odevice 110. In an embodiment, packets and/or packet metadata transmittedto network I/O device 110 and/or received from network I/O device 110are stored in one or more of primary memory 130 and/or storage devices165. In at least one embodiment, storage devices 165 may be one or moreof hard disk drives (HDDs) and/or solid-state drives (SSDs). In anembodiment, storage devices 165 may be non-volatile memories (NVMs). Insome examples, as shown in FIG. 1, circuitry 120 may communicativelycouple to network I/O device 110 via communications link 155. In oneembodiment, communications link 155 is a Peripheral Component InterfaceExpress (PCIe) bus conforming to revision 4.0 or other versions of thePCIe standard. In some examples, operating system 150, NW I/O devicedriver 140, and application 160 are implemented, at least in part, viacooperation between one or more memory devices included in primarymemory 130 (e.g., volatile or non-volatile memory devices), storagedevices 165, and elements of circuitry 120 such as processing cores122-1 to 122-m, where “m” is any positive whole integer greater than 2.In an embodiment, OS 150, NW I/O device driver 140, and application 160are executed by one or more processing cores 122-1 to 122-m. In otherembodiments, there are other endpoint devices coupled to communicationslink 155 (e.g., PCIe interconnect) that support Scalable IOVcapabilities.

In some examples, computing platform 101, includes but is not limited toa computer server, a server array or server farm, a web server, anetwork server, an Internet server, a work station, a mini-computer, amain frame computer, a supercomputer, a network appliance, a webappliance, a distributed computing system, multiprocessor systems,processor-based systems, a laptop computer, a tablet computer, asmartphone, or a combination thereof. In one example, computing platform101 is a disaggregated server. A disaggregated server is a server thatbreaks up components and resources into subsystems and connects themthrough network connections (e.g., network sleds). Disaggregated serverscan be adapted to changing storage or compute loads as needed withoutreplacing or disrupting an entire server for an extended period of time.A server could, for example, be broken into modular compute, I/O, powerand storage modules that can be shared among other nearby servers.

Circuitry 120 having processing cores 122-1 to 122-m may include variouscommercially available processors, including without limitation Intel®Atom®, Celeron®, Core (2) Duo®, Core i3, Core i5, Core i7, Itanium®,Pentium®, Xeon® or Xeon Phi® processors, ARM processors, and similarprocessors. Circuitry 120 may include at least one cache 135 to storedata.

According to some examples, primary memory 130 may be composed of one ormore memory devices or dies which may include various types of volatileand/or non-volatile memory. Volatile types of memory may include, butare not limited to, dynamic random-access memory (DRAM), staticrandom-access memory (SRAM), thyristor RAM (TRAM) or zero-capacitor RAM(ZRAM). Non-volatile types of memory may include byte or blockaddressable types of non-volatile memory having a 3-dimensional (3-D)cross-point memory structure that includes chalcogenide phase changematerial (e.g., chalcogenide glass) hereinafter referred to as “3-Dcross-point memory”. Non-volatile types of memory may also include othertypes of byte or block addressable non-volatile memory such as, but notlimited to, multi-threshold level NAND flash memory, NOR flash memory,single or multi-level phase change memory (PCM), resistive memory,nanowire memory, ferroelectric transistor random access memory (FeTRAM),magneto-resistive random-access memory (MRAM) that incorporatesmemristor technology, spin transfer torque MRAM (STT-MRAM), or acombination of any of the above. In another embodiment, primary memory130 may include one or more hard disk drives within and/or accessible bycomputing platform 101.

FIG. 2 illustrates an example diagram 200 of some high-level differencesbetween SR-IOV 202 and Scalable IOV 222 architectures. Unlike thecoarse-grained device partitioning and assignment approach 208 adoptedby SR-IOV to create multiple VFs 214 on a PF 212, the Scalable IOVarchitecture 222 enables software to flexibly compose virtual devicesutilizing hardware assists for device sharing at finer granularity.Frequent (i.e., performance critical) operations on the composed virtualdevice are mapped directly to the underlying device hardware (e.g.,scalable IOV device 230), while infrequent operations are emulatedthrough device-specific composition software 226 in VMM/Host OS 228.This is different than the existing architecture for SR-IOV devices 202,where only the device-agnostic PCI Express architectural resources (suchas configuration space registers and message signaled interruptsextended (MSI-X) capability registers) of the virtual device arevirtualized in software, and the rest of the virtual device resources(including all other memory mapped I/O (MMIO)) are mapped directly tothe underlying VF 214 hardware resources (e.g., SR-IOV device 216).

In the SR-IOV architecture 202 using hardware replication, a pluralityof VMs and containers 204 run on top of a VMM and/or host OS 206. Devicepartitioning and assignment logic 208 assigns I/O requests to PF driver210 which is coupled with physical function (PF) 212 in SR-IOV device216, or to virtual functions (VFs) 214 in SR-IOV device 216. Incontrast, in the Scalable IOV architecture 222 of embodiments of thepresent invention using replication and composition, many more VMs andcontainers 224 are supported. VMs and/or containers 224 call devicecomposition logic 226 in VMM/host OS 228 to implement I/O requests.Device composition logic 226 assigns I/O requests to PF driver 210,which is coupled with fine-grained, provisioned device resources (whichalso includes physical functions) in Scalable IOV device 230, ordirectly to fine-grained, provisioned device resources 232 in ScalableIOV device 230.

The Scalable IOV architecture provides benefits over SR-IOV.Fine-grained provisioning of device resources to VMs 224 along withsoftware emulation of infrequent device accesses enables devices toincrease sharing scalability at lower hardware cost and complexity. TheScalable IOV architecture provides system software such as VMM/Host OS228 the flexibility to share device resources with different addressdomains using different abstractions (e.g., application 160 processes toaccess through system calls and VMs/containers 224 to access throughvirtual device interfaces). Through software-controlled dynamic mappingof virtual devices (VDEVs) to device resources, the Scalable IOVarchitecture of embodiments of the present invention also enables VMMsto over-provision device resources to VMs 224.

The present approach also enables VMMs 228 to easily maintaingenerational compatibility in a data center. For example, in a datacenter with physical machines containing different generations (e.g.,versions) of the same I/O device, a VMM can use software emulation tovirtualize a VDEV's MMIO-based capability registers to present the sameVDEV capabilities irrespective of the different generations of physicalI/O device. This is to ensure that the same guest OS image with a VDEVdriver can be deployed or migrated to any of the physical machines.

The Scalable IOV architecture is composed of the following elements. Thearchitecture supports PCI Express endpoint device requirements andcapabilities. The architecture supports host platform (e.g., RootComplex) requirements including enhancements to direct memory access(DMA) remapping hardware. In an embodiment, these requirements areimplemented on Intel processor-based computing platforms as part ofIntel® Virtualization Technology for Directed I/O, Rev 3.0 or higher.The architecture also supports a reference software architectureenvisioned for enabling Scalable IOV, including host system software (OSand/or VMM 228) enabling infrastructure and endpoint device specificsoftware components such as host driver, guest driver, and a virtualdevice composition module (VDCM).

PCI Express endpoint devices may support requirements to operate withScalable IOV independent of its support for SR-IOV. This enables deviceimplementations that already support SR-IOV to maintain this capabilityfor backwards compatibility while adding the additional capabilities tosupport Scalable IOV.

In embodiments of the present invention, an endpoint physical functionis capable of both SR-IOV and Intel Scalable-IOV to be enabled tooperate in one mode or other, but not concurrently.

The PCI Express SR-IOV architecture follows a near complete functionalhardware replication of the Physical Function (PF) 212 hardware for itsVirtual Functions (VFs) 214. This is realized by most SR-IOV deviceimplementations by replicating most of the PF's hardware/softwareinterface for each of its VFs, including resources such as memory mappedresources, MSI-X storage and capabilities such as Function Level Reset(FLR). Such a functional replication approach can add to devicecomplexity and impose limitations to scale to large numbers of VFs.

The hardware-software interface for I/O controller implementations canbe categorized as (a) slow path control/configuration operations thatare less frequent and have the least impact on overall deviceperformance; and (b) fast path command/completion operations that arefrequent and have a higher impact on the overall device performance.This distinction of slow path versus fast path operations are practicedby many high performance I/O devices supporting direct user-mode access.The Scalable IOV architecture extends such device designs to define asoftware composable approach to I/O virtualization and sharing.

The Scalable IOV architecture requires endpoint devices (i.e., scalableIOV devices 230) to organize their hardware/software interfaces intofast path (frequent) and slow path (infrequent) accesses. Whichoperations and accesses are distinguished as slow path versus fast pathis controlled by device implementation. Slow path accesses typicallyinclude initialization, control, configuration, management, errorprocessing, and reset operations. Fast path accesses typically includedata path operations involving work submission and work completionprocessing. With this organization, slow path accesses to the virtualdevice from a guest VM are trapped and emulated by device-specific hostsoftware while fast path accesses are directly mapped on to the physicaldevice. This approach enables simplified device designs (compared toSR-IOV full functional replication), without compromising I/Ovirtualization scalability or performance. Additionally, the hybridapproach provides increased flexibility for software to compose virtualdevices through fine-grained provisioning of device resources.

High performance I/O devices support a large number ofcommand/completion interfaces for efficient multiplexing/de-multiplexingof I/O requests and in some usages to support user-mode I/O requests. Afew examples of such devices are: a) high-bandwidth network controllerssupporting thousands of transmit/receive (TX/RX) queues across a largenumber of Virtual Switch Interfaces (VSIs); b) storage controllers suchas NVM Express (as described in the non-volatile memory (NVM) Expressspecification, version 1.3c, available at nvmexpress.org) devicessupporting many command and completion queue pair constructs; c)accelerator devices such as GPUs supporting a large number of graphicsand/or compute contexts; d) reconfigurable FPGA devices with AcceleratorFunctional Units (AFUs) supporting a large number of execution contexts;and e) remote direct memory access (RDMA) capable devices supportingthousands of Queue Pair (QP) interfaces.

The Scalable IOV architecture takes advantage ofmulti-queue/multi-context capable high performance I/O device designsand defines an approach to share these devices at a finer granularity(queues, queue bundles, contexts, etc.) than SR-IOV VF granularity. Toachieve this finer-grained sharing, the Scalable IOV architecture ofembodiments of the present invention defines the granularity of sharingof a device as an ‘Assignable Device Interface’ (ADI) on the device.According to an embodiment, an ADI is the unit of assignment for aScalable IOV capable device. Each ADI instance on the device encompassesthe set of resources on the device that are allocated by software tosupport the fast path operations for a virtual device (VDEV).

Conceptually, ADI is similar to a SR-IOV virtual function (VF), exceptit is finer-grained and maps to the fast path operations for a virtualdevice. Unlike VFs, all ADIs on a Physical Function (PF) share theRequester-ID (e.g., Bus/Device/Function number) of the PF, have no PCIconfiguration space registers, share the same Base Address Register(BAR) resources of the PF (i.e., no virtual function base addressregisters (VFBARs)), and do not require replicated MSI-X storage.Instead of MSI-X table storage for each ADI, PF implements a devicespecific Interrupt Message Storage (IMS). IMS is similar to MSI-X tablestorage in purpose but is not architectural and instead is implementedin a device specific manner for maximum flexibility. Additionally,unlike some SRIOV devices which implement VF⇔PF communication channelsand ‘resource remapping logic’ on the device, ADIs use slow-pathemulation to provide such functionality. ADI's memory-mapped registerspace is laid out such that fast path registers are in separate systempage size regions than the slow path registers. The host driver for aScalable IOV capable device defines the collection of device back-endresources that are grouped to form an ADI.

FIG. 3 illustrates an example diagram 300 of some differences betweenSR-IOV capable and Scalable IOV capable endpoint devices. SR-IOV device216 includes physical function (PF) base address registers (PF BARs)302, PF configuration (PF config) circuitry 304, and PF message signaledinterrupt extended (MSI-X) circuitry 306. PF MSI-X 306 provides theMSI-X capability as defined by the PCI Express Base Specification.SR-IOV device 216 also includes a plurality of sets of virtual function(VF) VF BARs 312, VF config 314, and MSI-X 316. PF BARs 302 and VF BARs312 are coupled to device resource remapping logic and VF⇔PF mailboxlogic 308, which calls device backend resources 310. In an embodiment,device backend resources 310 includes a plurality of queues for storingpackets. Device backend resources 310 may include command/statusregisters, on device queues, references to in-memory queues, localmemory on the device, or any other device specific internal constructs.

Scalable IOV device 230 includes PF BARs 320, which also includes aplurality of ADI MMIO components 322. PF BARS 320 are coupled with PFconfig 324, PF MSI-X 326, and interrupt message storage (IMS) for ADIs328. PF MSI-X 326 provides the MSI-X capability as defined by the PCIExpress Base Specification. IMS 328 enables devices to store theinterrupt messages for ADIs in a device-specific optimized mannerwithout the scalability restrictions of PCI Express defined MSI-Xcapability. PF BARs 320 and ADI MMIO components 322 are coupled withdevice backend resources 330. Device backend resources 330 may includecommand/status registers, on device queues, references to in-memoryqueues, local memory on the device, or any other device specificinternal constructs.

The device-specific and light-weight nature of ADIs, along with theflexibility to emulate portions of the virtual device functionality indevice-specific host software, enables device hardware implementationsto compose a large number of virtual devices for scalable sharing atlower device cost and complexity compared to equivalent scaling ofSR-IOV VFs.

With the SR-IOV architecture, each VF 214 in a SR-IOV device 216 isidentified by a PCI Express Requester identifier (ID) (RID), allowingDMA remapping hardware support in Root Complex (such as Intel® VT-d) toapply unique address translation functions for upstream requests fromthe VF. A RID is a bus, device and function number identity for a PCIExpress PF or VF. RIDs are also used for routing transactions such asread completions for the PCI Express device hierarchy, and hence can bea scarce resource on some platform topologies with large I/O fan-outdesigns. This can impose scalability limitations on the number ofisolated domains a SR-IOV device can support.

The Scalable IOV architecture of embodiments of the present inventionaddresses the platform scalability issue by sharing the RID of thephysical function (PF) 232 with all of its ADIs, and instead assigningADIs a Process Address Space Identifier (PASID) that is conveyed inupstream transactions using a PCI Express PASID transaction layer packet(TLP) Prefix. Refer to the PCI Express specification for details on thePASID TLP Prefix. The computing platform 101 support for the ScalableIOV architecture enables unique address translation functions forupstream requests at PASID granularity. Unlike RID, PASID is not usedfor transaction routing on the I/O fabric but instead is used only toconvey the address space targeted by a memory transaction. Additionally,PASIDs are 20-bit IDs compared to 16-bit RIDs, which gives 16× moreidentifiers. This use of PASIDs by the Scalable IOV architecture enablessignificantly more domains to be supported by Scalable IOV devices.

FIG. 4 illustrates an example diagram of a high-level softwarearchitecture 400 for Scalable IOV. FIG. 4 illustrates components used todescribe the Scalable IOV architecture and is not intended to illustrateall virtualization software or specific implementation choices. Tosupport broad types of device classes and implementations, the softwareresponsibilities are abstracted between system software (OS 150/VMM 180)and device-specific driver software components.

Thus, FIG. 4 is a description of system software (host OS 150 and VMM180) and device-specific software roles and interactions to composehardware-assisted virtual devices along with how to manage deviceoperations. The software architecture described is focused on I/Ovirtualization for virtual machines and machine containers. However, theprinciples can be applied with appropriate software support to otherdomains such as I/O sharing across bare-metal containers or applicationprocesses.

The Scalable IOV architecture of embodiments of the present inventionintroduces a device-specific software component referred to as theVirtual Device Composition Module (VDCM) 402 that is responsible forcomposing one or more virtual device (VDEV) 404 instances utilizing oneor more Assignable Device Interfaces (ADIs) 406, 408, which the VDCMdoes by emulating VDEV slow path operations/accesses and mapping theVDEV fast path accesses to ADI instances allocated and configured on thephysical device. Unlike SRIOV VFs 214, VDCM 402 allows Scalable IOVdevices 230 to avoid implementing slow path operations in hardware andinstead to focus device hardware to efficiently scale the ADIs.

Additionally, virtualization management software (e.g., a VMM 180) makesuse of VDCM 402 software interfaces for enhanced virtual device resourceand state management, enabling capabilities such as suspend, resume,reset, and migration of virtual devices. Depending on the specific VMMimplementation, VDCM 402 is instantiated as a separate user or kernelmodule or may be packaged as part of a host driver.

Host driver 412 for a Scalable IOV capable device 230 is conceptuallyequivalent to a SR-IOV PF driver 210. Host driver 412 is loaded andexecuted as part of host OS 150 or VMM (hypervisor) software 180. Inaddition to the role of a normal device driver, host driver 412implements software interfaces as defined by host OS 150 or VMM 180infrastructure to support enumeration, configuration, instantiation, andmanagement of a plurality of ADIs 428, 430, 432, 434. Host driver 412 isresponsible for configuring each ADI such as its PASID identity,device-specific Interrupt Message Storage (IMS) 328 for storing ADI'sinterrupt messages, MMIO register resources 322 for fast-path access tothe ADI, and any device-specific resources.

Table 1 illustrates an example high-level set of operations that hostdriver 412 supports for managing ADIs. These operations are invokedthrough software interfaces defined by specific system software (e.g.,host OS 150 or VMM 180) implementations.

TABLE 1 Host driver interfaces for Scalable IOV Description Scalable IOVcapability reporting for the PF. Enumeration of types and maximum numberof ADIs/VDEVs. Enumeration of resource requirements for each ADI type.Enumeration and setting of deployment compatibility for ADIs.Allocation, configuration, reset, drain, abort, release of ADI and itsconstituent resources. Setting and managing PASID identity of ADIs.Managing device-specific Interrupt Message Storage (IMS) for ADIs.Enabling guest to host communication channel (if supported). Configuringdevice-specific QoS properties of ADIs. Enumerating and managingmigration compatibility of ADIs. Suspending/saving state of ADIs, andrestoring/resuming state of ADIs.

Virtual Device Composition Module (VDCM) 402 is a device specificcomponent responsible for composing one or more virtual device (VDEV)404 instances using one or more ADIs 406, 408 allocated by host driver412. VDCM 402 implements software-based virtualization of VDEV 404 slowpath operations and arranges for fast path operations to be submitteddirectly to the backing ADIs 428, 430, 432, 434. Host OS 150 or VMM 180implementations supporting such hardware-assisted virtual devicecomposition may require VDCM to be implemented and packaged by devicevendors in different ways. For example, in some OS or VMMimplementations, VDCM 402 is packaged as user-space modules or librariesthat are installed as part of the device's host driver 412. In otherimplementations, VDCM 402 is a kernel module. If implemented as alibrary, VDCM 402 may be statically or dynamically linked with theVMM-specific virtual machine resource manager (VMRM) responsible forcreating and managing VM resources. If implemented in the host OSkernel, VDCM 402 can be part of host driver 412.

Guest driver 424 for a Scalable IOV capable device 230 is conceptuallyequivalent to a SR-IOV device VF driver. In an embodiment, guest driver424, resident in guest VM 422, manages VDEV instances 404 composed byVDCM 402. Fast path accesses 426 by guest driver 424 are issued directlyto ADIs 432, 434 behind VDEV 404, while slow path accesses 420 areintercepted and virtualized by VM resource manager (VMRM) 416 and VDCM402. Similar to implementation choices available for SR-IOV PF 212 andVF 214 drivers, for a target OS 150, guest driver 424 is deployed as aseparate driver or as a unified driver that supports both host OS 150and guest VM 422 functionality. For existing SR-IOV devices 216, if VDEV404 is composed to behave like an existing VF 214, Scalable IOV guestdriver 424 can even be the same as the SR-IOV VF 214 driver for backwardcompatibility.

In embodiments of the present invention, Virtual Device (VDEV) 404 isthe abstraction through which a shared physical device (e.g., ScalableIOV device 230) is exposed to software in guest VM 422. VDEVs 404 areexposed to guest VM 422 as virtual PCI Express enumerated devices, withvirtual resources such as virtual Requester-ID, virtual configurationspace registers, virtual memory BARs, virtual MSI-X table, etc. EachVDEV 404 may be backed by one or more ADIs 428, 430, 432, 434. The ADIsbacking a VDEV 404 typically belong to the same PF 232 butimplementations are possible where they are allocated across multiplePFs (for example to support device fault tolerance or load balancing).

A PF 232 may support multiple types of ADIs, both in terms of number ofdevice backend resources 330 and in terms of functionality. Similarly,multiple types of VDEV compositions are possible (with respect to thenumber of backing ADIs, functionality of ADIs, etc.) on a Scalable IOVdevice 230. VDCM 402 may publish support for composing multiple ‘VDEVtypes’, enabling a virtual machine resource manager (VMRM) 416 torequest different types of VDEV instances for assigning to virtualmachines (VMs). VDCM 402 uses host OS 150 and VMM 180 defined interfacesto allocate and configure resources needed to compose a plurality ofVDEV 404 instances. VDEV instances may be assigned to VMs 422 in thesame way as SR-IOV VFs 214.

VDEV 404 may be composed of a static number of ADIs that arepre-allocated at the time of VDEV instantiation or composed dynamicallyby VDCM 402 in response to guest driver 424 requests to allocate/freeresources. An example of statically allocated ADIs is a virtual NIC(vNIC) with a fixed number of RX/TX queues. An example of dynamicallyallocated ADIs is a virtual accelerator device, where context allocationrequests are virtualized by VDCM 402 to dynamically create acceleratorcontexts as ADIs.

VDEV's MMIO registers 322 may be composed with any of the followingmethods for any system page size regions of the VDEV MMIO space.

1) Direct Mapped to ADI MMIO. As part of composing a VDEV instance 404,VDCM 402 defines the system page size ranges in VDEV virtual BARs inguest physical address (GPA) space that need to be mapped to MMIO pageranges of backing ADIs in host physical address (HPA) space. VDCM 402may request VMM 180 to set up GPA to HPA mappings in the host processor122-1 . . . 122-M virtualization page tables, enabling direct access byguest driver 424 to the ADI. These direct mapped MMIO ranges supportfast path operations 426 to ADIs 432, 434.

2) VDEV MMIO Intercepted and Emulated by VDCM. Slow path registers for aVDEV are virtualized by VDCM 402 by requesting VMM 180 to not map theseMMIO regions 322 in the host processor virtualization page tables, thusforcing host intercepts when guest driver 424 accesses these registers.These intercepts are provided to the VDCM module 402 composing the VDEVinstance 404, so that VDCM 402 may virtualize such intercepted accessesby itself or through interactions with host driver 412. To minimize thesoftware complexity on slow path access emulation, host OS 150 orvirtualization providers may restrict guest drivers 424 to use simplememory move operations of eight bytes or less to access VDEV's slow pathMMIO resources. VDEV registers that are read frequently and have no readside-effects, but require VDCM intercept and emulation on writeaccesses, may be mapped as read-only to backing memory pages provided byVCDM. This supports high performance read accesses to these registersalong with virtualizing their write side-effects by intercepting onguest write accesses. ‘Write intercept only’ registers must be hosted inseparate system page size regions from the ‘read-write intercept’registers on the VDEV MMIO layout.

3) VDEV MMIO 322 Mapped to Memory 130. VDEV registers that have no reador write side effects may be mapped to primary memory 130 with read andwrite access. These registers may contain parameters or data for asubsequent operation performed by writing to an intercepted register.Device implementations may also use this approach to define virtualregisters for VDEV-specific communication channels between guest driver424 and VDCM 402. Guest driver 424 writes data to the memory backedvirtual registers without host intercepts, followed by a mailboxregister access that is intercepted by the VDCM. This optimizationreduces host intercept and instruction emulation cost for passing databetween guest and host. Such an approach enables guest drivers 424 toimplement such channels with VDCM more generally than hardware-basedcommunication doorbells (as is often implemented between SR-IOV VFs 214and PF 212) or without depending on guest OS or VMM specificpara-virtualized software interfaces.

VDEVs 404 expose a virtual MSI or virtual MSI-X capability that isemulated by VDCM 402. Guest driver 424 requests VDEV interrupt resourcesnormally through guest VM 422 interfaces, and the guest VM may servicethis by programming one or more Interrupt Messages through the virtualMSI or virtual MSI-X capability of VDEV 404.

For typical virtual device compositions, there are two sources ofinterrupts delivered as VDEV interrupts to guest driver 424. One sourceis VDCM software 402 itself that may generate virtual interrupts onbehalf of the VDEV to be delivered to the guest driver. These aresoftware generated interrupts by the slow path operations of the VDEVemulated by the VDCM. The other source of interrupts is ADI instances432, 434 on the device that are used to support fast path operations ofVDEV 404. ADI generated interrupts use interrupt messages stored inInterrupt Message Storage (IMS) 328.

When guest VM 422 programs the virtual MSI or MSI-X register, theoperation is intercepted and virtualized by VDCM 402. For slow pathvirtual interrupts, the VDCM requests virtual interrupt injection to theguest through the VMM 180 software interfaces. For fast path interruptsfrom ADIs, the VDCM invokes host driver 412 to allocate and configurerequired interrupt message address and data in the IMS. This isconceptually similar to how MSI-X interrupts for SR-IOV VFs arevirtualized by some virtualization software, except the interruptmessages are programmed in the IMS by host driver 412 as opposed to inan MSI-X table by a PCI driver.

For device-specific usages and reasons, Scalable IOV capable devices 230may choose to build communication channels between guest driver 424 andVDCM 402. These communication channels can be built in a guest and hostsystem software agnostic manner with either of below methods.

1) Software emulated communication channel. Such a channel is composedby VDCM 402 using one or more system page size regions in VDEV MMIOspace set up as fully memory-backed to enable sharing of data betweenguest VM 422 and host OS 150. A host intercepted system page size regionin VDEV MMIO space is also set up to signal a guest action to the host.Optionally, a virtual interrupt may also be setup by the VDCM to signalthe guest about completion of asynchronous communication channelactions.

2) Hardware mailbox-based communication channel. If the communicationbetween guest driver 424 and host driver 412 is frequent and thesoftware emulation-based communication channel overhead is significant,Scalable IOV device 230 may implement communication channels based onhardware mailboxes. This is similar to communication channels betweenSR-IOV VFs 214 and PF 212 in some existing designs.

Shared Virtual Memory (SVM) refers to usages where a device is operatingin the CPU virtual address space of the applications sharing the device.SVM usage is enabled with system software programming the DMA remappinghardware to reference the CPU page tables for requests with PASIDrepresenting the target applications virtual address space. Devicessupporting such SVM capability do not require pages that are accessed bythe device to be pinned and instead supports PCI Express AddressTranslation Services (ATS) and Page Request Service (PRS) capabilitiesto support recoverable device page faults. Refer to PCI Expressspecification for details on ATS and PRS capabilities.

A device supporting the Scalable IOV architecture can independentlysupport SVM usages on ADIs allocated to host applications or for ADIsallocated to guest applications through the VDEV instance assigned toguest VM 422. Both the host and guest SVM usages are transparent to theADI operation. One difference is in the address translation functionprogramming of the Root Complex DMA remapping hardware. The addresstranslation function programmed for PASIDs representing host SVM usagerefers to respective CPU virtual address to physical addresstranslation, while the address translation function programmed forPASIDs representing guest SVM usage refers to respective nested address(guest virtual address to guest physical address and further to hostphysical address) translation.

A set of requirements and capabilities for an endpoint device to supportthe Scalable IOV architecture will now be described. The requirementsapply to both Root-Complex Integrated Endpoint (RCIEP) and PCI ExpressEndpoint (PCIEP) devices. In an embodiment, the endpoint device may be aNIC, a storage controller, a GPU, a FPGA, an application specificintegrated circuit (ASIC), or other circuitry.

As described previously, the Scalable IOV architecture defines theconstructs for fine-grained sharing on endpoint devices (i.e., scalableIOV devices 230) as Assignable Device Interfaces (ADIs) 428, 430, 432,434. ADIs form the unit of assignment and isolation for Scalable IOVcapable devices 230 and are composed by software to form virtualdevices. The requirements for endpoint devices for enumeration,allocation, configuration, management and isolation of ADIs is asfollows.

Resources on an endpoint device associated with fast path worksubmission, execution and completion operations are referred to asdevice backend resources.

Assignable Device Interfaces (ADIs) 428, 430, 432, 434 refer to a set ofdevice backend resources 330 that are allocated, configured andorganized as an isolated unit, forming the unit of device sharing. Thetype and number of backend resources grouped to compose an ADI is devicespecific. For example, for a network controller device (such as anEthernet NIC), an ADI may be composed of a set of TX/RX queues andresources associated with a Virtual Switch Interface (VSI). An ADI on astorage controller may be the set of command queues and completionqueues associated with a storage namespace. Similarly, an ADI on a GPUmay be organized as a set of graphics or compute contexts created onbehalf of a virtual-GPU device instance. Depending on the design, ADI onan FPGA device may be an entire Accelerator Function Unit (AFU) or acontext on a multi-context capable AFU.

The SR-IOV architecture specifies the allocation of PCI Expressarchitectural resources through the VF construct but leaves it to deviceimplementations on how the device backend resources are allocated andassociated with specific VFs 214. Devices that want to flexiblyprovision a variable number of backend resources to VFs 214 (e.g., onequeue-pair to a first VF and one queue-pair to another VF) need toimplement another level of ‘resource remapping logic’ (as shown at block308 in FIG. 3) within the endpoint device to map which device backendresources 310 are accessible through specific VFs 214 and isolated fromaccess through other VFs. Such resource remapping logic 308 in theendpoint device increases device complexity as the number of VFs andbackend resources are scaled.

The SR-IOV software architecture provides for a virtual device instanceto be composed of a single VF, whereas the Scalable IOV softwarearchitecture of embodiments of the present invention allows software tocompose a virtual device (VDEV) instance through the use of one or moreADIs. This enables endpoint device hardware designs to avoid the needfor complex resource remapping logic internal to the endpoint device.

For example, consider a device that uses queue-pairs (QPs) 436, 438,440, 442 as backend resources 330, and a VM 422 that needs eight QPs inthe VDEV for its workloads. In the SR-IOV architecture designs will haveto map a VF 214 to eight QPs, with either static partitioning of eightQPs per VF, or dynamic partitioning of eight QPs to a VF using resourceremapping logic 308 in the endpoint device. An equivalent Scalable IOVcapable device design treats each QP as an ADI 428, 430, 432 or 434 anduses VDCM software 402 to compose a VDEV using eight ADIs. In this casethe resource remapping functionality is implemented in VDCM 402.

FIG. 5 illustrates an example diagram 500 of a logical view of ADIs witha varying number of device backend resources 330, and virtualizationsoftware composing virtual device instances 508, 510, 512 with one ormore ADIs 520, 522, 524, 526. There are one or more guest partitionssuch as guest partition 1 502, guest partition 2 504, . . . guestpartition J 506, where J is a natural number, being executed bycomputing platform 101. There are one or more virtual devices (VDEVs)such as virtual device 1 508, virtual device 2 510, . . . virtual deviceK 512, where K is a natural number, being executed by computing platform101. Each guest partition may call one or more virtual devices for I/Orequests. For example, guest partition 502 calls virtual device 1 508,guest partition 2 504 calls virtual device 2 510, and so on to guestpartition J calls virtual device K 512. There may be any number of guestpartitions. There may be any number of virtual devices. The maximumnumber of virtual devices being called by any one guest partition isimplementation dependent. Within endpoint device hardware (i.e.,Scalable IOV device 230), there are one or more ADIs, such as ADI 1 520,ADI 2 522, ADI 3 524, . . . ADI M 526, where M is a natural number.There may be any number of ADIs in Scalable IOV device 230 (i.e., it isimplementation dependent), and there are one or more Scalable IOVdevices (e.g., network I/O devices 110) in computing platform 101. Thenumber of Scalable IOV devices used in a computing platform isimplementation dependent. Each ADI uses one or more device backendresources 330. For example, ADI 1 520 uses backend resource 1 528, ADI 2522 uses backend resource 2 520, ADI 3 524 uses backend resource 3 532,backend resource 4 534, and backend resource 5 536, and ADI M 526 usesbackend resource N 538. There may be any number of backend resources inScalable IOV device 230. The number of backend resources in Scalable IOVdevice is implementation dependent.

Any virtual device 508, 510, 512 may take a slow path or a fast path forI/O requests. For example, virtual device 1 508 calls slow path softwareemulation 514 or fast path direct mapping 540 to ADI 1 520. For example,virtual device 1 508 also calls ADI 2 522 via fast path direct mapping540. For example, virtual device 2 510 calls slow path softwareemulation 516 or calls ADI 3 524 via fast path direct mapping 542. Forexample, virtual device K 152 calls slow path software emulation 518 orcalls ADI M 526 via fast path direct mapping 544.

Unlike SR-IOV VFs whose requests are tagged with each VF's uniqueRequester-ID (RID), in embodiments of the present invention requestsfrom all ADIs of a PF are tagged with the RID of the PF. Instead,requests from ADIs are distinguished through a Process Address SpaceIdentifier (PASID) in an end-to-end PASID TLP Prefix. The PCI Expressspecification defines the Process Address Space Identifier (PASID) inthe PASID TLP Prefix of a transaction, which in conjunction with theRID, identifies the address space associated with the request.

The definition of the address space targeted by a PASID value isdependent on the Root Complex DMA remapping hardware capability and theprogramming of such hardware by software. Computing platforms withIntel® Virtualization Technology for Directed I/O, Rev 3.0 or higher,supports the Scalable IOV architecture through PASID-granular addresstranslation capability. Depending on the programming of such DMAremapping hardware, the address space targeted by a request with PASIDcan be a Host Physical Address (HPA), Host Virtual Address (HVA), HostI/O Virtual Address (HIOVA), Guest Physical Address (GPA), Guest VirtualAddress (GVA), Guest I/O Virtual Address (GIOVA), etc. All of theseaddress space types can coexist on computing platform 101 for differentPASID values and ADIs from one or more Scalable IOV devices may beconfigured to use these PASIDs.

When assigning an ADI to an address domain (e.g., VM 422, container, orprocess), the ADI is configured with the unique PASID of the addressdomain and the ADI's memory requests are tagged with the PASID value inthe PASID TLP Prefix. If multiple ADIs are assigned to the same addressdomain, they may be assigned the same PASID. If ADIs belonging to avirtual device (VDEV) assigned to a VM are further mapped to secondaryaddress domains (e.g., application processes) within the VM, each suchADI is assigned a unique PASID corresponding to the secondary addressdomain. This enables usages such as Shared Virtual Memory (SVM) within aVM, where a guest application process is assigned an ADI, and similar tonested address translation (GVA to GPA to HPA) for CPU accesses by theguest application, requests from the guest's ADI are also subjected tosame nested translation by the DMA remapping hardware. Depending on theusage model, an ADI may also be allowed to use more than one PASID valueand in this case the semantics of which PASID value to use with whichrequest is Scalable IOV device dependent. For example, an ADI may beconfigured to access meta-data, commands and completions with one PASIDthat represents a restricted control domain, while the data accesses areassociated with the PASID of the domain to which the ADI is assigned.

In an embodiment, devices 230 supporting the Scalable IOV architecturesupport the PASID capability as defined by the PCI Express specificationand comply with all associated requirements. Before enabling ADIs on aPF 232, the PASID capability on PF 232 is enabled by software. Before anADI is activated, the ADI is configured with a PASID value. All upstreammemory requests (except Address Translation Service (ATS) translatedrequests) generated by ADIs are tagged with the assigned PASID valueusing the PASID TLP Prefix. In an embodiment, ADIs are not able togenerate memory requests (except ATS translated requests) without aPASID or to generate memory requests with a PASID value in the PASID TLPPrefix that is not the ADI's assigned PASID value.

Each ADI's memory mapped I/O (MMIO) 322 registers are hosted within oneor more of the PCI Express Base Address Registers (BARs) 320 of thehosting PF 232. Each ADI's MMIO 322 registers are contained in one ormore system page size and aligned regions, and these may be contiguousor scattered regions within the PF's MMIO space 322. The associationbetween the number and location of system page size regions within thePF's MMIO to specific ADIs is device-specific. The system page sizessupported by the Scalable IOV device is reported via the Intel ScalableIOV enumeration capability described below. In an embodiment, for Intel®64-bit computing platforms, the system page size is 4 kilobytes (KBs).

Devices supporting the Scalable IOV architecture partition their ADIMMIO 322 registers into two categories: (a) MMIO registers accessedfrequently for fast path operations; and (b) MMIO registers accessedinfrequently for slow path (control, configuration, management etc.)operations. The definition of what operations are designated as slowpath versus fast path is device-specific. PF 232 locates registers inthese two categories in distinct system page size regions. This enablesvirtualization software such as host OS/VMM 228 to directly map fastpath operations to one or more constituent ADIs while emulating slowpath operations in software 514, 516, 518.

In an embodiment, devices implement 64-bit BARs 320 so that the addressspace above 4 gigabyte (GB) can be used for scaling ADI MMIO 322resources. Additionally, since non-prefetchable BARs use MMIO spacebelow 4 GB even with 64-bit BARs, in one embodiment devices implementprefetchable 64-bit BARs.

ADIs capable of generating interrupts generate only message signaledinterrupts (MSIs) (no legacy interrupts). ADIs do not share interruptresources/messages with the PF or with another ADI. An ADI may supportone or more interrupt messages. For example, an ADI composed of N queueson a PF may support N interrupt messages to distinguish work arrivals orcompletions for each queue, where N is a natural number.

The Scalable IOV architecture enables device implementations to supporta large number of ADIs, and each ADI may use multiple interruptmessages. To support the large interrupt message storage for all ADIs, adevice-specific construct called Interrupt Message Storage (IMS) forADIs 328 is defined. IMS 328 enables devices to store the interruptmessages for ADIs in a device-specific optimized manner without thescalability restrictions of the PCI Express defined MSI-X capability.

Even though the IMS storage organization is device-specific, in oneembodiment IMS entries store and generate interrupts using the sameinterrupt message address and data format as with PCI Express MSI-Xtable entries. Interrupt messages stored in IMS 328 are composed of aDWORD size data payload and a 64-bit address. IMS 328 may alsooptionally support per-message masking and pending bit status, similarto the per-vector mask and pending bit array in the PCI Express MSI-Xcapability. In an embodiment, the IMS resource is programmed by the hostdriver 412.

PFs hosting the ADIs may support PCI Express defined MSI or MSI-Xcapability. Interrupts generated by PF 232 may use the PF's MSI or MSI-Xcapability 326 as specified by the PCI Express specification, whileinterrupts generated by ADIs may use the device-specific IMS 328.Specific host OS 150/VMM 180 implementations according to embodiments ofthe present invention support the use of IMS 328 for PF's interruptsand/or the use of PF's MSI-X table for ADI interrupts.

The size, location, and storage format for IMS 328 is device-specific.For example, some devices may implement IMS as on-device storage, whileother stateful devices that manage contexts that are saved to andrestored from primary memory 130 may implement IMS as part of thecontext privileged state. In either approach, devices may implement IMS328 as either one unified storage structure or as de-centralized per ADIstorage structures. If IMS 328 is implemented in host primary memory130, ADIs may cache IMS entries on the Scalable IOV device. If theScalable IOV device implements IMS caching, the Scalable IOV device alsoimplements device specific interfaces for the device-specific driver toinvalidate the IMS cache entries.

IMS 328 is managed by host driver software 412 and is not madeaccessible directly from guest or user-mode drivers in guest partitions502, 504, 506. Within the Scalable IOV device, IMS storage is notdirectly accessible from the ADIs, and instead the ADIs can requestinterrupt generation only through the PF's ‘Interrupt Message GenerationLogic’. This ensures that ADIs cannot modify IMS contents and an ADI canindirectly generate interrupts only using IMS entries assigned by hostdriver software 412 to the corresponding ADI.

On Intel Architecture (IA) based 64-bit architecture computingplatforms, message signaled interrupts are issued as DWORD sizeuntranslated memory writes without a PASID TLP Prefix, to address range0xFEExxxxx. Since all memory requests generated by ADIs include a PASIDTLP Prefix while interrupt messages are generated without a PASID TLPprefix, it is not possible to generate a DMA write to the interruptmessage address (0xFEExxxxx on IA based 64-bit computing platforms)through an ADI and cause the platform to interpret the DMA write as aninterrupt message.

Operations or functioning of one ADI must not affect functioning ofanother ADI or functioning of PF 232. Every memory request (except ATStranslated requests) from an ADI must be with a PASID TLP Prefix usingthe ADI's assigned PASID value in the PASID TLP prefix. The PASIDidentity for an ADI is accessed or modified by privileged software suchas through host driver 412.

Since ADIs on Scalable IOV device 230 are part of PF 232, the PCIExpress Access Control Service (ACS) capability is not applicable forisolation between ADIs. Instead, devices disable peer-to-peer access(either internal to the device or at I/O fabric egress) between ADIs andbetween ADIs and the PF. Independent of Scalable IOV architecturesupport, PF 232 may support ACS guidelines for isolation across endpointfunctions or devices, per the PCI Express specification.

Quality of service (QoS) for ADIs are defined specific to a givenScalable IOV device. ADI QoS attributes are managed by host driver 412and controlled by VDCM 402 through host driver 412 interfaces.

ADI specific errors are errors that can be attributed to a particularADI such as malformed commands or address translation errors. Sucherrors do not impact functioning of other ADIs or the PF 232. Handlingof ADI specific errors are implemented in device-specific ways.

Each ADI is independently resettable without affecting the operation ofother ADIs. However, unlike SR-IOV VFs 214, ADIs do not support FunctionLevel Reset (FLR) capability. Instead, reset of an ADI is performedthrough software interfaces to host driver 412 via ADI resetconfiguration 410 as shown in FIG. 4. To support ADI reset, Scalable IOVdevices implement interfaces to abort (e.g., discard) in-flight andaccepted operations to the ADI by a specific domain (or PASID). In anembodiment, a VDEV 404 composed out of ADIs may expose a virtual FLRcapability that may be emulated by VDCM 402 by requesting host driver412 to perform the ADI reset for the constituent ADIs 428, 430, 432 or434 for VDEV 404.

An ADI reset ensures that the reset is not reported as complete untilthe following conditions are satisfied: a) all DMA write operations bythe ADI are drained or aborted; b) all DMA read operations by the ADIhave completed or aborted; c) all interrupts from the ADI have beengenerated; d) if ADI is capable of Address Translation Service (ATS),all ATS requests by the ADI have completed or aborted; and e) if the ADIis capable of Page Request Service (PRS), no more page requests will begenerated by the ADI. Additionally, either page responses have beenreceived for all page requests generated by the ADI or the ADI willdiscard page responses for any outstanding page requests.

In an embodiment, PFs 232 support Function Level Reset (FLR) and mayoptionally support additional device specific global reset control.Global reset operation and FLR on a PF 232 resets all of its ADIs andreturns the PF to a state where no ADIs are configured.

In an embodiment, PFs 232 support saving and restoring ADI state tofacilitate operations such as live migration and suspend/resume ofvirtual devices composed from such ADIs. For example, to support ADIsuspend, Scalable IOV devices 230 implement interfaces to drain (i.e.,complete) in-flight and accepted operations to the ADI by a specificdomain (or PASID). In an embodiment, ADI suspend, ADI state save, ADIstate restore, and ADI resume from restored state are also implementedthrough host driver 412 interfaces.

A PF 232 reports support for the Scalable IOV architecture to systemsoftware such as VDCM 402 through the host driver 412 interface. If hostdriver 412 reports support for the Scalable IOV architecture, hostdriver 412 supports an extended set of interfaces to enumerate,provision, instantiate, and manage ADIs on the PF. System software suchas VDCM 402 performs all Scalable IOV specified operations on ScalableIOV device 230 through host driver 412.

Additionally, in one embodiment, a PCI Express Designated VendorSpecific Extended Capability (DVSEC) is defined for systems softwaresuch as VDCM 402 and software tools to detect devices supporting theScalable IOV architecture, without host driver dependency. Host driver412 is still responsible for enabling the Scalable IOV architecture andrelated operations through system software specific interfaces.

FIG. 6 illustrates a diagram of an example Scalable IOV DVSEC structure600. The fields up to offset 0xA are the standard DVSEC capabilityheader information. Refer to the PCI Express DVSEC header for a detaileddescription of these fields. The remaining fields are described below.

Function Dependency (DEP) Link (read-only (RO)) field 602 is at offset0xA and has a size of one byte. The programming model for a device mayhave vendor-specific dependencies between sets of functions. FunctionDependency Link field 602 is used to describe these dependencies. Thisfield describes dependencies between PFs 232. ADI dependencies are thesame as the dependencies of their PFs. If a PF 232 is independent fromother PFs of a Scalable IOV device 230, this field contains the PF's ownFunction Number. If a PF is dependent on other PFs of a Scalable IOVdevice, this field contains the Function Number of the next PF in thesame Function Dependency List (FDL). The last PF in an FDL contains theFunction Number of the first PF in the FDL.

Dependencies between PFs are described by the Flags field 604 at offset0xB. Flags field 604 (read only) is at offset 0xB and has a size of onebyte. In an embodiment, Flags 604 field includes a homogeneous (H) flagin bit 0 of the byte, and bits 1 through 7 are reserved. When the H flagis reported as set, the H flag indicates that all PFs in the FDL must beenabled (in device-specific manner) for Scalable IOV operation. If somebut not all of the PFs in the FDL are enabled for Scalable IOVoperation, the behavior is undefined (i.e., one PF cannot be in ScalableIOV operation mode and another in SR-IOV operation mode if the H flag isreported as set). If H flag is not set, PFs in the FDL can be indifferent modes.

Supported Page Sizes (read only) field 606 is at offset 0xC and has asize of four bytes. Supported Page Sizes field 606 indicates page sizessupported by PF 232. The PF supports a page size of 2^(n+12) if bit n isset. For example, if bit 0 is set, the PF supports 4 KB pages. The pagesize describes the minimum alignment requirements for ADI MMIO 322 pagesso that they can be independently assigned to different address domains.In an embodiment, PFs are required to support 4 KB page sizes. PFs maysupport additional system page sizes for broad compatibility across hostplatform architectures.

System Page Size (read-write (RW)) field 608 is at offset 0x10 and has asize of four bytes. System Page Size field 608 defines the page size thesystem uses to map the ADIs' MMIO 322 pages. Software sets the value ofthe system page size to one of the page sizes set in the Supported PageSizes field. As with Supported Page Sizes, if bit n is set in SystemPage Size, the ADIs associated with this PF support a page size of2^(n+12). For example, if bit 1 is set, the device uses an 8 KB pagesize. The behavior is undefined if System Page Size is zero, more thanone bit is set, or a bit is set in the System Page Size field that isnot set in supported page sizes.

When System Page Size field 608 is written, PF 232 aligns all ADI MMIO322 resources on system page size boundaries. System page size must beconfigured before setting the Memory Space Enable bit in the PCI commandregister of the PF. The behavior is undefined if System Page Size ismodified after memory space enable bit is set. A default value is 00000001h, indicating a system page size of 4 KB.

Capabilities (read only) field 610 is at offset 0x14 and has a size offour bytes. In an embodiment, Capabilities field 610 includes an IMSSupport flag in bit 0, and bits 1 to 31 are reserved. The IMS Supportflag indicates support for Interrupt Message Storage (IMS) in thedevice. When the IMS Support flag is 0, IMS is not supported by thedevice. When the IMS Support flag is 1, IMS is supported by the device.

If virtualization software (such as VDCM 402) does not support IMS useby the PF itself (IMS use supported only for PFs ADIs), when the PF isdirectly assigned to a domain, for compatibility, virtualizationsoftware may expose a virtual Scalable IOV capability to the domain withthe IMS Support flag reported as 0.

In an embodiment, the Scalable IOV architecture relies on the followingplatform level capabilities: a) support for PCI Express PASID TLP Prefixin supporting Root Ports (RPs), Root Complex (RC), and DMA remappinghardware units (refer to the PCI Express Revision 4.0 specification fordetails on PASID TLP Prefix support); and b) PASID-granular addresstranslation by DMA remapping hardware such as that defined by scalablemode address translation in Intel® Virtualization Technology (Intel® VT)for Directed I/O (Intel® VT-d), Revision 3.0 or higher.

Scalable mode address translation as defined by Intel® VT-d involves athree-stage address translation. Other embodiments may use othermethods. First, the Requester-ID (RID) (Bus/Device/Function numbers) inupstream requests are used to consult the Root and Context structuresthat specify translation behavior at RID (PF or SR-IOV VF) granularity.The context structures refer to PASID structures. Second, if the requestincludes a PASID TLP Prefix, the PASID value from the TLP prefix is usedto consult the PASID structures that specify translation behavior atPASID (target address domain) granularity. If the request is without aPASID TLP Prefix, the PASID value programmed by software in the Contextstructure is used instead. For each PASID, the respective PASIDstructure entry can be programmed to specify first-level, second-level,pass-through or nested translation functions, along with references tofirst-level and second-level page-table structures. Finally, the addressin the request is subject to address translation using the first-level,second-level or both page-table structures, depending on the type oftranslation function.

The PASID granular address translation enables upstream requests fromeach ADI on a PF to have a unique address translation. Any such ADIs428, 430, 432, or 434 on a PF 232 can be used by VDCM 402 to composevirtual devices 404 that may be assigned to any type of address domain(such as guest physical address space of a VM or machine container, I/Ovirtual address for a bare-metal container, shared CPU virtual addressfor an application process, or such guest containers or processesoperating within a VM).

For interrupt isolation across devices, host system software, such asHost OS 150, VMM 180, and/or VDCM 402 enables interrupt remapping anduses remappable interrupt message format for all interrupt messagesprogrammed in MSI, MSI-X 326, or IMS 328 on the device. Refer to theIntel Virtualization Technology for Directed I/O specification fordetails on interrupt remapping.

In various embodiments, computing platforms supporting the Scalable IOVarchitecture also support Posted Interrupts capability. PostedInterrupts enables scalable interrupt virtualization by enablinginterrupt messages to operate in guest interrupt vector space withoutconsuming host processor interrupt vectors. It additionally enablesdirect delivery of virtual interrupts to active virtual processorswithout hypervisor processing overheads. Refer to Intel VirtualizationTechnology for Directed I/O architecture specification for details onposted interrupts. Posted interrupt operation is transparent to endpointdevices.

FIG. 7 illustrates an example high-level translation structureorganization 700 for scalable mode address translation. In anembodiment, scalable mode address translation is accomplished usingscalable mode root table 702. Scalable mode root table 702 includes aplurality of entries. In this example, the number of entries in thescalable mode root table is 256, the maximum number of buses possible incomputing platform 101. An example entry N 708 in scalable mode roottable 702 points to an entry in scalable mode lower context table 710and to an entry in scalable mode upper context table 712. In anembodiment, there are 128 entries in scalable mode lower context table710, numbered as device (DEV)=0 entry 714 up to device=15 entry 716(each of the 16 devices has eight entries). Each entry in scalable modelower context table specifies a function. Functions IDs are numbersbetween 0 and 7 in one embodiment. In an embodiment, there are 128entries in scalable mode upper context table 712, numbered as device=16entry 718 up to device=31 entry 720 (each of the 16 devices has eightentries). Each entry in scalable mode upper context table specifies afunction. A selected entry in either scalable mode lower context table710 or scalable mode upper context table 712 points to scalable modePASID directory 722. In an embodiment, there are 2̂14 entries in scalablemode PASID directory 722, from entry number 0 724 to entry number 2̂14−1726. The index into the directory is formed from bits 6 through 19 of aPASID in one embodiment. A selected entry in scalable mode PASIDdirectory 722 points to a selected entry in scalable mode PASID table728. In an embodiment, there are 64 entries in scalable mode PASID table728, from entry number 0 730 to entry number 63 732. The index intoscalable mode PASID table 728 is formed from bits 0 through 5 of a PASIDin one embodiment. A selected entry in scalable mode PASID table 728points to first level page table structures 734 and to second level pagetable structures 736.

FIG. 8 illustrates an example diagram of virtual device composition 800.Host OS 150 includes VDCM 402 to use VDEV template 404 to instantiate avirtual device VDEV 1 816 for operation with guest driver 424 in guestVM 422. VDEV 404 includes PASID 802 to uniquely identify the VDEVinstance (e.g., VDEV 1 816). VDEV 404 includes virtual PCI configurationspace (VCS) 804 to define parameters for the VDEV instance, such ascapabilities (e.g., MSI-X, BARs, and other PCI capabilities), errorhandling, quality of service (QoS) settings, and reset indicators. VDEV404 includes MSI-X table 806 to store a plurality of MSI-X interruptentries. VDEV 404 includes MMIO fields for mapping addresses ofregisters in PF 232 for fast path ADI MMIO 808, software emulated,hypervisor intercepted (HI) MMIO 810, and memory-backed (MB) MMIO 812.ADI MMIO 808 provides fast path resources for VDEV 404. VM resourcemanager 416 provides services to VDCM 402, assigns PASIDs to VDEVs, andtriggers VDEV composition. IOMMU driver 813 configures translationtables for IOMMU with scalable IOV extensions 414 for a PASID assignedto VEV 404.

Physical function (PF) 232 includes PF base address registers (BARs)320, which comprise one or more ADI MMIO registers 322, denoted ADI MMIO828, ADI MMIO 830, . . . ADI MMIO 832, ADI MMIO 834 in FIG. 8. A BaseAddress Register (BAR) is used to specify how much memory a device(e.g., PF) wants to be mapped into primary memory 130, and after deviceenumeration, a BAR holds the (base) address, where the mapped memoryblock begins. ADI MMIO registers 828, 830, . . . 832, 834 are used byVDEVs to directly communicate with hardware queues 436, 438, . . . 440,442. PF 232 includes PF config 324 to define PCI configurationparameters for the PF, such as device vendor ID, BARs, error reporting,etc. PF 232 stores PASIDs for ADIs 814, and PF MSI-X 326 provides theMSI-X capability as defined by the PCI Express Base Specification. IMSfor ADIs 328 enables PF 232 to store interrupt messages for ADIs in aPF-specific optimized manner without the scalability restrictions of thePCI Express defined MSI-X capability. In one embodiment, interruptmessages are used to indicate that data (e.g., packets) are available inqueues 436, 438, . . . 440, 442 to be processed by VDEV 1 816.

The life cycle of a VDEV is divided into four stages: 1) VM and VDEVdefinition; 2) VM instantiation and VDEV composition; 3) Runtime VDEVresource modification; and 4) VM shutdown and VDEV decomposition.

At the first stage, a user (such as a system administrator, a remotelyrun script, or other control mechanism) selects resources for a VM to beinstantiated (e.g., guest VM 422 and other guest VMs). VM resourcesinclude virtual processors, memory, and physical devices such as networkdevices, storage devices, graphics processing units (GPUs), fieldprogrammable gate arrays (FPGAs), accelerator devices, etc. The useralso selects, depending on a VDEV type (such as network, storage, GPU,FPGA, accelerator, etc.), resources for each VDEV to be instantiated.For example, for a network VDEV, the user may select initial resourcessuch as a number of receive (Rx) and transmit (Tx) queues and a numberof IMS entries, and a maximum number of resources of each type that aVDEV can request during the lifespan of the VDEV instance. In oneexample, a network VDEV is instantiated with two ADIs and N entries inIMS for ADIs 328, where N is a natural number, but additional ADIs andIMS entries may be requested and added during runtime due to processingworkloads. In an embodiment, a number of ADIs for a VDEV is the same asa number of interrupt entries in IMS for ADIs 328. That is, there is aone-to-one correspondence between an ADI and an interrupt entry in IMSfor ADIs 328.

At the second stage, a VM is instantiated (e.g., a guest VM 422) andassociated one or more VDEVs are composed by VDCM 402 along with S-IOVcapable PF host driver 412. FIG. 9 illustrates an example flow diagram900 of virtual device composition. After a VDEV (such as VDEV 1 816) isinstantiated by VDCM 402, at block 902 VDCM 402 creates virtual PCIconfiguration (config) space (VCS) 804 in VDEV 1 816. VDCM 402 thenpopulates values into VCS 804 fields in the following three blocks. Atblock 904, VDCM 402 assigns a device vendor ID of PF 232 to VCS 804 inVDEV 1 816. This associates PF 232 with VDEV 816. At block 906, VDCM 402prepares MSI-X BAR in VCS 804 of VDEV 1 816 by setting locations inprimary memory 130 representing the VDEV. At block 908, VDCM 402prepares one or more MMIO BARs in VCS 804 of VDEV 1 816 by settinglocations in primary memory. In an embodiment, MMIO memory space isprovisioned for a VDEV for a maximum amount of resources a VDEV can havein the VDEV's lifetime, but initially the VDEV is configured with fewerresources. At block 910, VDCM 402 gets one or more ADIs and associatedregister pages from PF host driver 412 and maps the ADIs and associatedregister pages to VDEV MMIO space. At block 912, VDCM 402 designatesMMIO space for software emulated HI MMIO registers 810 and memory backedMB MMIO registers 812. At block 914, VDCM 402 assigns the instantiatedVDEV (e.g., VDEV 1 816) to guest VM 422.

FIG. 10 illustrates an example diagram of virtual device composition1000 using physical function ADIs and Interrupt Message Services (IMS)resources. For example, a network VDEV is instantiated and composed ofADIs, where each ADI represents a set of receive/transmit queues and PFIMS resources are used to communicate with the VDEV's MSI-Xconfiguration. Additionally, VDEV MMIO is configured by the VDCM foremulated HI MMIO registers 810 (e.g., a slow path interface) where highperformance processing is not required. In an embodiment, the behaviorof the emulated HI MMIO registers 810 is implemented by VDCM 402 and PFhost driver 412. Guest VM 422 accesses BAR for MMIO 1002 in VDEV configspace (VCS 804). BAR for MMIO 1002 points to a location in VDEV memorypages 1006. VDEV memory pages 1006 are resident in primary memory 130 onhost computing platform 101. The location in the VDEV memory pages 1006is mapped to a location in physical function address space representedas PF and ADI fast path register pages 1012 on PF 232. Software emulatedHI MMIO or memory backed registers pages 1010 also point to VDEV memorypages 1006. Guest VM 422 also accesses BAR for MSI-X 1004 in VCS 804.BAR for MSI-X 1004 points to a location in VDEV MSI-X Table 1008, whichis an instance of MSI-X table 806. The location in VDEV MSI-X table 1008points to a location in PF and IMS MSI-X vectors 1004. In an embodiment,PF and IMS MSI-X vectors 1004 is the same as PF MSI-X 326.

In an embodiment, IMS is located in PF MMIO space, which can includeMSI-Z table entries of both the PF and VDEVs. One benefit of IMS is thatunlike the MSI-X table, IMS is not restricted to a maximum size of 2K.Since IMS is in PF MMIO space, IMS can support many MSI-X like entries(e.g., more than 2K).

At the third stage, the resources available to a VDEV may be modifiedduring runtime. FIG. 11 illustrates an example flow diagram 1100 ofvirtual device resource modification. In some cases, the user may selectthe maximum amount of resources (e.g., ADIs, IMS entries) for a VDEV,but the VDEV may be initially composed with only a minimum amount ofresources. In embodiments, the values for maximum and minimum areimplementation dependent. During runtime, depending on processingworkload requirements, an event may be triggered to request the dynamicaddition or modification of resources to an existing VDEV. In anembodiment, the event is triggered by the actions of a systemadministrator. In another embodiment, the event is triggeredprogrammatically (e.g., by a controller component).

At block 1102, VDCM 402 receives an external request to increase ormodify VDEV resources. External in this situation refers to an entityoutside of VDCM 402, guest VM 422, and VDEV 1 816. At block 1104, inresponse to receiving the request VDCM 402 requests additional ADIs andIMS interrupt entries from PF host driver 412. At block 1106, VDCM 402enables new MMIO BAR mappings. At block 1108, VDCM 402 notifies guestdriver 424 in guest VM 422 of newly available resources.

At the fourth stage, the guest VM may be shut down and the guest VM'sVDEVs decomposed. FIG. 12 illustrates an example flow diagram 1200 ofvirtual device decomposition. This processing is performed for each VDEVin a guest VM being shut down. At block 1202, VDCM 402 starts a functionlevel reset (FLR) as described in the PCI Express specification for aVDEV (e.g., VDEV 1 816). At block 1204, VDCM 402 un-maps the VDEV's MMIOBARs from the guest VM's memory space. At block 1206, VDCM 402 returnsADIs and IMS entries to PF host driver 412. PF host driver 412 makesthese ADIs available for reuse by other VDEVs. In the case of networkVDEVs, the PF host driver disables the Rx/Tx queues 436, 438, . . . 440,442 and makes these queues available for use by other VDEVs. PF hostdriver 412 also makes the corresponding IMS entries available for use byother VDEVs. PF host driver 412 also frees up internal resourcesprovisioned for the VDEV.

FIG. 13 illustrates an example of a storage medium 1300. Storage medium1300 may comprise an article of manufacture. In some examples, storagemedium 1300 may include any non-transitory computer readable medium ormachine readable medium, such as an optical, magnetic or semiconductorstorage. Storage medium 1300 may store various types of computerexecutable instructions, such as instructions 1302 to implement logicflows of S-IOV architecture components, including those described inFIGS. 9, 11, and 12. Examples of a computer readable or machine-readablestorage medium may include any tangible media capable of storingelectronic data, including volatile memory or non-volatile memory,removable or non-removable memory, erasable or non-erasable memory,writeable or re-writeable memory, and so forth. Examples of computerexecutable instructions may include any suitable type of code, such assource code, compiled code, interpreted code, executable code, staticcode, dynamic code, object-oriented code, visual code, and the like. Theexamples are not limited in this context.

FIG. 14 illustrates an example computing platform 1400. In someexamples, as shown in FIG. 14, computing platform 1400 may include aprocessing component 1402, other platform components 1404 and/or acommunications interface 1406.

According to some examples, processing component 1402 may executeprocessing operations or logic for instructions stored on storage medium1300. Processing component 1402 may include various hardware elements,software elements, or a combination of both. Examples of hardwareelements may include devices, logic devices, components, processors,microprocessors, circuits, processor circuits, circuit elements (e.g.,transistors, resistors, capacitors, inductors, and so forth), integratedcircuits, application specific integrated circuits (ASIC), programmablelogic devices (PLD), digital signal processors (DSP), field programmablegate array (FPGA), memory units, logic gates, registers, semiconductordevice, chips, microchips, chip sets, and so forth. Examples of softwareelements may include software components, programs, applications,computer programs, application programs, device drivers, systemprograms, software development programs, machine programs, operatingsystem software, middleware, firmware, software modules, routines,subroutines, functions, methods, procedures, software interfaces,application program interfaces (API), instruction sets, computing code,computer code, code segments, computer code segments, words, values,symbols, or any combination thereof. Determining whether an example isimplemented using hardware elements and/or software elements may vary inaccordance with any number of factors, such as desired computationalrate, power levels, heat tolerances, processing cycle budget, input datarates, output data rates, memory resources, data bus speeds and otherdesign or performance constraints, as desired for a given example.

In some examples, other platform components 1404 may include commoncomputing elements, such as one or more processors, multi-coreprocessors, co-processors, memory units, chipsets, controllers,peripherals, interfaces, oscillators, timing devices, video cards, audiocards, multimedia input/output (I/O) components (e.g., digitaldisplays), power supplies, and so forth. Examples of memory units mayinclude without limitation various types of computer readable andmachine readable storage media in the form of one or more higher speedmemory units, such as read-only memory (ROM), random-access memory(RAM), dynamic RAM (DRAM), Double-Data-Rate DRAM (DDRAM), synchronousDRAM (SDRAM), static RAM (SRAM), programmable ROM (PROM), erasableprogrammable ROM (EPROM), electrically erasable programmable ROM(EEPROM), types of non-volatile memory such as 3-D cross-point memorythat may be byte or block addressable. Non-volatile types of memory mayalso include other types of byte or block addressable non-volatilememory such as, but not limited to, multi-threshold level NAND flashmemory, NOR flash memory, single or multi-level PCM, resistive memory,nanowire memory, FeTRAM, MRAM that incorporates memristor technology,STT-MRAM, or a combination of any of the above. Other types of computerreadable and machine-readable storage media may also include magnetic oroptical cards, an array of devices such as Redundant Array ofIndependent Disks (RAID) drives, solid state memory devices (e.g., USBmemory), solid state drives (SSD) and any other type of storage mediasuitable for storing information.

In some examples, communications interface 1406 may include logic and/orfeatures to support a communication interface. For these examples,communications interface 1406 may include one or more communicationinterfaces that operate according to various communication protocols orstandards to communicate over direct or network communication links orchannels. Direct communications may occur via use of communicationprotocols or standards described in one or more industry standards(including progenies and variants) such as those associated with thePCIe specification. Network communications may occur via use ofcommunication protocols or standards such those described in one or moreEthernet standards promulgated by IEEE. For example, one such Ethernetstandard may include IEEE 802.3. Network communication may also occuraccording to one or more OpenFlow specifications such as the OpenFlowSwitch Specification.

The components and features of computing platform 1400, including logicrepresented by the instructions stored on storage medium 1300 may beimplemented using any combination of discrete circuitry, ASICs, logicgates and/or single chip architectures. Further, the features ofcomputing platform 1400 may be implemented using microcontrollers,programmable logic arrays and/or microprocessors or any combination ofthe foregoing where suitably appropriate. It is noted that hardware,firmware and/or software elements may be collectively or individuallyreferred to herein as “logic” or “circuit.”

It should be appreciated that the exemplary computing platform 1400shown in the block diagram of FIG. 14 may represent one functionallydescriptive example of many potential implementations. Accordingly,division, omission or inclusion of block functions depicted in theaccompanying figures does not infer that the hardware components,circuits, software and/or elements for implementing these functionswould necessarily be divided, omitted, or included in embodiments.

Various examples may be implemented using hardware elements, softwareelements, or a combination of both. In some examples, hardware elementsmay include devices, components, processors, microprocessors, circuits,circuit elements (e.g., transistors, resistors, capacitors, inductors,and so forth), integrated circuits, ASIC, programmable logic devices(PLD), digital signal processors (DSP), FPGA, memory units, logic gates,registers, semiconductor device, chips, microchips, chip sets, and soforth. In some examples, software elements may include softwarecomponents, programs, applications, computer programs, applicationprograms, system programs, machine programs, operating system software,middleware, firmware, software modules, routines, subroutines,functions, methods, procedures, software interfaces, application programinterfaces (API), instruction sets, computing code, computer code, codesegments, computer code segments, words, values, symbols, or anycombination thereof. Determining whether an example is implemented usinghardware elements and/or software elements may vary in accordance withany number of factors, such as desired computational rate, power levels,heat tolerances, processing cycle budget, input data rates, output datarates, memory resources, data bus speeds and other design or performanceconstraints, as desired for a given implementation.

Some examples may include an article of manufacture or at least onecomputer-readable medium. A computer-readable medium may include anon-transitory storage medium to store logic. In some examples, thenon-transitory storage medium may include one or more types ofcomputer-readable storage media capable of storing electronic data,including volatile memory or non-volatile memory, removable ornon-removable memory, erasable or non-erasable memory, writeable orre-writeable memory, and so forth. In some examples, the logic mayinclude various software elements, such as software components,programs, applications, computer programs, application programs, systemprograms, machine programs, operating system software, middleware,firmware, software modules, routines, subroutines, functions, methods,procedures, software interfaces, API, instruction sets, computing code,computer code, code segments, computer code segments, words, values,symbols, or any combination thereof.

Some examples may be described using the expression “in one example” or“an example” along with their derivatives. These terms mean that aparticular feature, structure, or characteristic described in connectionwith the example is included in at least one example. The appearances ofthe phrase “in one example” in various places in the specification arenot necessarily all referring to the same example.

Included herein are logic flows or schemes representative of examplemethodologies for performing novel aspects of the disclosedarchitecture. While, for purposes of simplicity of explanation, the oneor more methodologies shown herein are shown and described as a seriesof acts, those skilled in the art will understand and appreciate thatthe methodologies are not limited by the order of acts. Some acts may,in accordance therewith, occur in a different order and/or concurrentlywith other acts from that shown and described herein. For example, thoseskilled in the art will understand and appreciate that a methodologycould alternatively be represented as a series of interrelated states orevents, such as in a state diagram. Moreover, not all acts illustratedin a methodology may be required for a novel implementation.

A logic flow or scheme may be implemented in software, firmware, and/orhardware. In software and firmware embodiments, a logic flow or schememay be implemented by computer executable instructions stored on atleast one non-transitory computer readable medium or machine readablemedium, such as an optical, magnetic or semiconductor storage. Theembodiments are not limited in this context.

Some examples are described using the expression “coupled” and“connected” along with their derivatives. These terms are notnecessarily intended as synonyms for each other. For example,descriptions using the terms “connected” and/or “coupled” may indicatethat two or more elements are in direct physical or electrical contactwith each other. The term “coupled,” however, may also mean that two ormore elements are not in direct contact with each other, but yet stillco-operate or interact with each other.

It is emphasized that the Abstract of the Disclosure is provided tocomply with 37 C.F.R. Section 1.72(b), requiring an abstract that willallow the reader to quickly ascertain the nature of the technicaldisclosure. It is submitted with the understanding that it will not beused to interpret or limit the scope or meaning of the claims. Inaddition, in the foregoing Detailed Description, it can be seen thatvarious features are grouped together in a single example for thepurpose of streamlining the disclosure. This method of disclosure is notto be interpreted as reflecting an intention that the claimed examplesrequire more features than are expressly recited in each claim. Rather,as the following claims reflect, inventive subject matter lies in lessthan all features of a single disclosed example. Thus, the followingclaims are hereby incorporated into the Detailed Description, with eachclaim standing on its own as a separate example. In the appended claims,the terms “including” and “in which” are used as the plain-Englishequivalents of the respective terms “comprising” and “wherein,”respectively. Moreover, the terms “first,” “second,” “third,” and soforth, are used merely as labels, and are not intended to imposenumerical requirements on their objects.

What is claimed is:
 1. A method comprising: instantiating a virtualmachine; instantiating a virtual device to transmit data to and receivedata from assigned resources of a shared physical device; and assigningthe virtual device to the virtual machine, the virtual machine totransmit data to and receive data from the physical device via thevirtual device.
 2. The method of claim 1, wherein assigning the virtualdevice to the virtual machine comprises exposing the virtual device tothe virtual machine as a virtual peripheral component interconnect (PCI)express enumerated device.
 3. The method of claim 1, wherein theassigned resources comprise shared physical device resources assigned tothe virtual device for data transfers.
 4. The method of claim 3, whereinthe shared physical device comprises a network controller device, theshared physical device resources comprise receive and transmit queues tostore the data, and the data comprises packets.
 5. The method of claim1, wherein the shared physical device comprises a storage controllerdevice.
 6. The method of claim 1, wherein instantiating a virtual devicecomprises: assigning an identifier of the shared physical device to thevirtual device; preparing a base address register for message signalinginterrupts for the virtual device; preparing one or more memory-mappedinput/output (MMIO) base address registers; getting one or more assignedresources and associated register pages and mapping the assignedresources and associated register pages to MMIO memory space for a fastpath interface to the virtual device; and designating MMIO memory spacefor emulated and memory-backed registers for a slow path interface tothe virtual device.
 7. The method of claim 6, comprising decomposing thevirtual device by starting a function level reset of the virtual deviceand un-mapping the assigned resources and associated register pages fromMMIO memory space for the virtual device.
 8. The method of claim 1,comprising assigning additional resources to the virtual device based atleast in part on a request received during runtime.
 9. At least onetangible machine-readable medium comprising a plurality of instructionsthat in response to being executed by a processor cause the processorto: instantiate a virtual machine; instantiate a virtual device totransmit data to and receive data from assigned resources of a sharedphysical device; and assign the virtual device to the virtual machine,the virtual machine to transmit data to and receive data from thephysical device via the virtual device.
 10. The least one tangiblemachine-readable medium of claim 9, wherein instructions to assign thevirtual device to the virtual machine comprise instructions to exposethe virtual device to the virtual machine as a virtual peripheralcomponent interconnect (PCI) express enumerated device.
 11. The leastone tangible machine-readable medium of claim 9, wherein the assignedresources comprise shared physical device resources assigned to thevirtual device for data transfers.
 12. The at least one tangiblemachine-readable medium of claim 9, wherein instructions to instantiatea virtual device comprise instructions to: assign an identifier of theshared physical device to the virtual device; prepare a base addressregister for message signaling interrupts for the virtual device;prepare one or more memory-mapped input/output (MMIO) base addressregisters; get one or more assigned resources and associated registerpages and mapping the assigned resources and associated register pagesto MMIO memory space for a fast path interface to the virtual device;and designate MMIO memory space for emulated and memory-backed registersfor a slow path interface to the virtual device.
 13. The at least onetangible machine-readable medium of claim 9, comprising instructions todecompose the virtual device by starting a function level reset of thevirtual device and un-mapping the assigned resources and associatedregister pages from MMIO memory space for the virtual device.
 14. Theleast one tangible machine-readable medium of claim 9, comprisinginstructions to assign additional resources to the virtual device basedat least in part on a request received during runtime.
 15. An apparatuscomprising: a virtual machine; and a virtual device composition modulecoupled to the virtual machine to instantiate a virtual device totransmit data to and receive data from assigned resources of a sharedphysical device; and to assign the virtual device to the virtualmachine, the virtual machine to transmit data to and receive data fromthe physical device via the virtual device.
 16. The apparatus of claim15, comprising the virtual device composition module to assign thevirtual device to the virtual machine by exposing the virtual device tothe virtual machine as a virtual peripheral component interconnect (PCI)express enumerated device.
 17. The apparatus of claim 15, wherein theassigned resources comprise shared physical device resources assigned tothe virtual device for data transfers.
 18. The apparatus of claim 17,wherein the shared physical device comprises a network controllerdevice, the shared physical device resources comprise receive andtransmit queues to store the data, and the data comprises packets. 19.The apparatus of claim 15, wherein virtual device composition module toassign an identifier of the shared physical device to the virtualdevice; prepare a base address register for message signaling interruptsfor the virtual device; prepare one or more memory-mapped input/output(MMIO) base address registers; get one or more assigned resources andassociated register pages and mapping the assigned resources andassociated register pages to MMIO memory space for a fast path interfaceto the virtual device; and designate MMIO memory space for emulated andmemory-backed registers for a slow path interface to the virtual device.20. The apparatus of claim 19, comprising the virtual device compositionmodule to decompose the virtual device by starting a function levelreset of the virtual device and un-mapping the assigned resources andassociated register pages from MMIO memory space for the virtual device.21. The apparatus of claim 15, comprising the virtual device compositionmodule to assign additional resources to the virtual device based atleast in part on a request received during runtime.